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Abstract. We consider the unsupervised learning problem of assigning 
labels to unlabeled data. A naive approach is to use clustering methods, 
but this works well only when data is properly clustered and each cluster 
corresponds to an underlying class. In this paper, we first show that this 
unsupervised labeling problem in balanced binary cases can be solved if 
two unlabeled datasets having different class balances are available. More 
specifically, estimation of the sign of the difference between probability 
densities of two unlabeled datasets gives the solution. We then introduce 
a new method to directly estimate the sign of the density difference 
without density estimation. Finally, we demonstrate the usefulness of 
the proposed method against several clustering methods on various toy 
problems and real-world datasets. 



1 Introduction 

Gathering labeled data is expensive and time consuming in many practical ma- 
chine learning problems, and therefore class labels are often absent. In this pa- 
per, we consider the problem of labeling, which is aimed at giving a label to 
each sample. Labeling is similar to classification, but it is slightly simpler than 
classification because classes do not have to be specified. That is, labeling just 
tries to split unlabeled samples into disjoint subsets, and class labels such as 
male/female or positive/negative are not assigned to samples. 

A naive approach to the labeling problem is to use a clustering technique 
which is aimed at assigning a label to each sample of the dataset to divide 
the dataset into disjoint clusters. The tacit assumption in clustering is that the 
clusters correspond to the underlying classes. However, this assumption is often 
violated in practical datasets, for example, when clusters are not well separated 
or a dataset exhibits within-class multimodality. 

An example of the labeling problem is illustrated in Figure I. Figure 1(a) 
denotes the densities of the two classes. Figure 1(b) denotes samples drawn from 
a mixture of the two original densities. Because the two clusters are highly over- 
lapping, it may not be possible to properly label them by a clustering method. 

In this paper we show that if one more dataset with a different class balance 
is available (Figure 1(c)), the labeling problem can be solved (Figures 1(d) and 




Fig. 1. Illustrative example of labeling samples from unbalanced datasets. Figures lb 
and Ic show the samples of the two datasets which differ only by class balance (the 
datasets are denoted as Xp and Xpi). The discriminant estimated by the method that 
we propose in this paper is given in blue and the optimal discriminant is given in the 
black dashed line. The true underlying class labels (which are unknown) are illustrated 
in red and black. 



1(e)). More specifically, we show that a labeling for the samples can be obtained 
by estimating the sign of the difference between probability densities of two unla- 
beled datasets. A naive way is to first separately estimate two densities from two 



sets of samples and then take the sign of their difference to obtain a labehng. 
However, this naive procedure violates Vapnik's principle[l]: 



If you possess a restricted amount of information for solving some prob- 
lem, try to solve the problem directly and never solve a more general 
problem as an intermediate step. It is possible that the available infor- 
mation is sufficient for a direct solution but is insufficient for solving a 
more general intermediate problem. 

This principle was used in the development of support vector machines (SVMs): 
Rather than modeling two classes of samples, SVM directly learns a decision 
boundary that is sufhcient for performing pattern recognition. 

In the current context, estimating two densities is more general than labeling 
samples. Thus, the above naive scheme may be improved by estimating the 
density difference directly and then taking its sign to obtain the class labels. 
Recently, a method was introduced to directly estimate the density difference, 
called the least-squares density difference (LSDD) estimator [2]. Thus, the use 
of LSDD for labeling is expected to improve the performance. 

However, the LSDD-based procedure is still indirect; directly estimating the 
sign of the density difference would be the most suitable approach to labeling. 
In this paper, we show that the sign of the density difference can be directly es- 
timated by lower-bounding the Li-distancc between probability densities. Based 
on this, we give a practical algorithm for labeling and illustrate its usefulness 
through experiments on various real-world datasets. 

2 Problem Formulation and Fundamental Approaches 

In this section, we formulate the problem of labeling, give our fundamental strat- 
egy, and consider two naive approaches. 

2.1 Problem Formulation 

Suppose that there are two probability distributions p(a;, y) and p'{x, y) on x G 
and y e {1, — 1}, which are different only in class balances: 

P{y)^p'{y) but p{x\y)^p'{x\y). (1) 

From these distributions, we are given two sets of unlabeled samples: 

Xp = {xiYU '- Xa;) and X^, - {x'^}']L^ ' '^ p'ix). 

The goal of labeling is to obtain a labeling for the two sets of samples, Xp and Xpi , 
that corresponds to the underlying class labels {yiYi^i and {yj}j^i- However, 
different from classification, we do not obtain correct class labels, but wc obtain 
correct class separation up to label commutation. 



2.2 Fundamental strategy 

We wish to obtain a labeling for samples in Xp and Xpi . Here we show that we 
can obtain the solution for the case where the class priors are equal. We may 
write the class-posterior distribution for the equal prior case as 

q{x) 

where q{y = 1) = q{y = —1) = ^. A class label can then be assigned to a point 
by evaluating 

sign {q{y = - q{y = -Mx)] 

We can write the criterion as 



q{y = l\x) - q{y = -l\x) = 



p{x\y = l)^ p{x\y=-l)^ 



q{x) q{x) 
(xp{x\y = 1) -p{x\y = -1). 

Wc do not have any labeled samples to calculate p{x\y = 1) — p{x\y = —1), but 
we can rewrite it in terms of marginal distributions. To sec this, the above is 
multiplied with p(y = 1) — p'{y = 1), which gives 

p{x\y = 1) - pix\y = -1) oc [p{y = 1) - p' {y = 1)] [pix\y = 1) - p{x\y = -1)] 

(xp{x,y^ 1) -p'{x,y = 1) 
-p{y = l)p{x\y = -l)+p{y = l)p{x\y ^ -1). 

Note that the sign may change since p{y = 1) — p'{y = 1) may be positive or 
negative. To write the third and fourth term as a joint distribution, we add and 
subtract p{x\y ~ —1), giving 

p{x\y = 1) -p{x\y = -1) (xp{x,y = 1) -p'{x,y = -1) + [1 - p{y = l)]p{x\y = 

-[l-p'{y = l)]p{x\y = -l). 

Since p{y = —1) — 1 — p{y — 1) and p'{y — —1) — 1 — p'{y — 1), we can express 
the above as 

q{y = l\x) - q{y = -l\x) oc p{x) - p'{x). 

The exact class labels can not be recovered since the term p{y = 1) — p'{y = 1) 
can be positive or negative. Therefore, we assign the label j/e{l,— l}toa point 
X according to the following criterion: 

y = sign[p{x) - p'{x)]. (2) 

Thus, now we need a good method to estimate sign [p{x) — p'{x)]. 



2.3 Kernel Density Estimation 

A naive approach to estimating tlie sign of density-difference is to use kernel 
density estimators (KDEs) [3]. For Gaussian kernels, the KDE solutions are 
given by 



p(a;)cx^exp( ^Z^j and oc ^ exp 

The Gaussian widths a and a' may be determined based on least-squares cross- 
validation [4]. Finally, a labeling is obtained as 

y = sign[p{x)-p{x)]. (3) 



2.4 Direct Estimation of the Density Difference 

KDE is a nice density estimator, but it is not necessarily suitable in density- 
difference estimation, because small estimation error incurred in each density 
estimate can cause a big error in the final density-difference estimate. More 
intuitively, good density estimators tend to be smooth and thus a density- 
difference estimator obtained from such smooth density estimators tends to be 
over-smoothed [5,6]. 

The density difference can be estimated in a single shot using the least-squares 
density difference (LSDD) approach [2] . In this approach, we directly fit a model 
g{x) to the density difference under the square loss: 

g = argmin i J {g{x) - {p{x) - p' {x))f da;, 

which can be efficiently obtained for a kernel density-difference model. A com- 
prehensive review of LSDD is provided in Appendix B. Finally, a labeling is 
obtained as 

y = sign[5(a;)]. 



3 Direct Estimation of the Sign of the Density Difference 

We expect that an improved solution can be obtained by LSDD over KDEs due 
to more direct nature of LSDD. However, LSDD is still indirect because the sign 
of density difference is inspected after the density difference is estimated. In this 
section, we show how to directly estimate the sign of the density difference. 



3.1 Derivation of the Objective Function 

By lower-bounding the Li-distance between probability densities, defined as 

I \p{x)~p'{x)\dx, (4) 



we can obtain the sign of the density difference. We begin by considering the 
following self-evident relation: 

|t| > tz, if \z\ < 1. 

We can apply this relation at each point x, to obtain 

\p{x)-p'ix)\ > g{x) [pix)-p\x)] if \gix)\ < 1, \/x. 

By applying the above inequality to Eq.(4) and maximizing with respect to g{x), 
we can obtain the tightest lower bound as 

J |p(a;) — p'(a;)| da; > sup J g{x)[p{x) ~ p'{x)]dx (5) 
s.t. \g{x)\ < 1, Va;. 

It is straightforward to verify that the above relation will be met with equality 
when 

g{x) = sign{p{x) ~p'ix)) . 

What makes the expression in the right-hand side of Eq.(5) especially useful 
is that the probability densities occur linearly in the integral. By replacing the 
integrals with sample averages and searching g{x) from a parametric family 
(denoted as ga{x)), we can write the above as 



rg min — V ffa (a;^ ) V .9a i^t ) 

s.t. |.ga(a;)| < 1, Va; 



n' '■ — ' 71 '■ — ' (6) 



3.2 Optimization 

Here we briefly discuss how to solve the problem in Eq. (6). A more detailed 
explanation is given in Appendix A. 

The function in Eq. (6) should satisfy the constraint \g{x)\ < 1, Va;. We can 
consider a clipped version of the function that always satisfies the constraint, 

g(a;) R{g{x)), where R{z) = < — 1 z < —1, 

I z otherwise. 

We use a linear-in-parameter model, 

b 

9{x) = ^ae(pe{x), 
e=i 



where ipi{x) are the basis functions. Using the above definitions, we can rewrite 
Eq.(6) as 



^ n' / b \ 1 " / \ X '' 

1=1 \i=i / j=i \e=i / e=i 



,2 



(7) 



where ^ J2e=i '^I ^ rcgularization term. Although the above is a non-convex 
problem, we can efficiently find a local optimal solution using the convex- concave 
procedure (CCCP) [7] (also known as difference of convex (d.c.) programming 
[8]). The CCCP procedure requires the objective function to be split into a 
convex and concave part, 

J [ex) = Jvox(a) + Jca.vc{a)- 

The concave part is then upper-bounded as 

where the bound is specified by b and c (details are given in Appendix A). This 
bound is convex w.r.t. b and c if cc is fixed. Using this bound, the optimization 
problem can then be expressed as 

J {ex) < Jvox(q:) + Jca.vc{a, b, c). 

The strategy to minimize J (a) is then to alternately minimize the right-hand 
side by minimizing w.r.t. ot (keeping 5 and c constant) and minimize w.r.t. b 
and c (keeping ct constant). Minimization w.r.t. cx minimizes the current upper 
bound and minimization w.r.t. b and c corresponds to tightening the bound at 
the current point. 

Minimization w.r.t. b and c can be performed by 

6, = jo ^'=1 < 1' and ^ 1° ^'=1 < (8) 

I 1 otherwise, I 1 otherwise. 

Minimization of the upper bound (assuming b and c is constant) can be per- 
formed by solving the following convex quadratic problem: 

1 ^' -j^ n b / 1 1 \ A ^ 

= E ^^+- E -E b E b^M<)+- E ^^^H^^) + 2 E 

4=1 j = l i=l \ 4=1 j = l J i=l 

b 



s.t.e: >0, >E"^^^(^^) + 1' Vi = l,...,n' (9) 

e=i 

b 

> 0, > E o^efdxj) - 1 Vj = 1, . . . , n. 



The above constrained problem can be solved with an off-the-shelf QP solver. 
Our final optimization algorithm is summarized below: 



1. Initialize the starting value: 

-fr- argmin Jvox(q:)- 

a 

2. Fort^ 1,...T: 

(a) Tighten the upper-bound: Obtain b and c as 

b,c ^ argmin J {a*, b, c), 

b,c 

by using Eq.(8). 

(b) Minimize the upper bound: 
Set 

a*+^ <- argmin Jvex(a) + Jcuveia, fa, c, ) 
by solving the convex problem in Eq.(9). 

In practice, Gaussian kernels centered at the sample points in Xp and Xp> are 
chosen as the basis functions. All hyper-parameters are set by cross-validation. 



4 Experiments 

We first illustrate the operation of our method and characterize the failures of 
other methods on various toy examples. Then we use real-world benchmark data 
to show the superiority of our algorithm. 

4.1 Numerical Illustration 

Toy Problem 1: We illustrate the problem and our method with a simple 
example. Suppose that the class-conditional densities for the two classes are 
given as 

p{x\y = 1) =Mj; {-12,12x2) and p{x\y = -1) = {I2, 1 2x2) , 

where J\fx{fi, S) denotes the normal density with mean (j, and covariance S w.r.t. 
a;. l2isa2xl vector of ones and J is a 2 x 2 identity matrix. We generate 2 sets 
of 30 samples with class-priors p{y = 1) = 0.3 and p'{y = 1) = 0.7, respectively. 
The result is illustrated in Figure 1. As can be seen from this example, we are 
able to obtain a labeling of the classes that roughly corresponds to the true 
(unknown) labels of the data. 



Toy Problem 2: One way to obtain a labeling is to use clustering. The tacit 
assumption in clustering is that samples in the same cluster belong the same 
class. This assumption however is not always be true, for example, when the 
class conditional densities are multimodal. Here we consider a problem with the 
following class conditional densities: 

p{x\y = 1) 

p{x\ij = -1) 

The two distributions arc plotted in Figure 2a. We can try to obtain a class 
label by performing clustering on XpUXpi ^. The results for k- means and spectral 
clustering, given in Figures 2d and 2e, show that these methods fail to reveal the 
true labeling. On the other hand, the proposed method still gives a reasonable 
result (Figure 2f). 

4.2 Benchmark Datasets 

We compare our method against several competing methods on benchmark 
datasets. For each experiment, we constructed the datasets Xp and Xp' by draw- 
ing n and n' samples from the positive and negative classes of the datasets 
according to a prior of p{y = 1) and p'{y = 1). The labeling was then performed 
using these two datasets. Since we can obtain a labeling, but cannot determine 
the original class labels, we cannot measure the performance using the misclas- 
sification rate directly. Assume that the label assigned for the sample Xi is 

^ _ j -1 pixi) - q{xi) < 
1 1 otherwise. 

The misclassification rate (MCR) assuming that the current labels are correct is 

MCR:=i y 1 + y 1. 
n ^ n' ^ 

The misclassification rate assuming that the labels are the opposite is 1 — MCR. 
We define the labeling error rate (LER) as 

LER := min (MCR, 1 - MCR) . 

Note that this definition is somewhat more optimistic than using the misclassifi- 
cation rate. The smaller the dataset is, the lower the error would be for randomly 

^ If clustering is performed separately on Xp and Xpi , we do not know which chisters in 
each dataset correspond to the clusters in the other dataset. We can also not perform 
clustering on one dataset and apply it to the other dataset, since most clustering 
methods do not give out of sample labeling. For these reasons, it makes most sense 
to perform clustering on the combined dataset. 



= 2-^x([3 0]' ,/2x2)+ 2A4([-3 0]' ,/2x2) 
= ^A/x([0 3]^,J2x2) + ^AA,([0 -3]^,/2x2). 
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Fig. 2. Illustration of within-class multimodality and clustering. 



assigning labels to samples: The expected LER for randomly assigning labels to 
samples (with equal probability) is 



For 71 + n' = 40, 60, 80, the expected labeling error rate is 0.437, 0.449, 0.456. 
Wc compared the following methods: 

Direct Sign Density Difference (DSDD) Estimation (proposed): Di- 
rectly estimate sign(p(a;) —p'{x)) using the method described in Section 3. 
Hyperparameters are selected via cross validation. 

Least-Squares Density Difference (LSDD) Estimation: Estimate 
sign [p(a;) —p'{x)] by estimating p(a;) —p'ix) using the least squares fitting 
method [9]. Hyperparameters are selected via cross validation. 

— Kernel Density Estimation (KDE): Estimate sign [p(a;) —p'{x)] by es- 
timating the densities p{x) and p'{x) with kernel density estimation (KDE). 
Hyperparameters are selected using least-squares cross validation. 

— K-Means (KM): Cluster the data into two clusters using the K-means 
algorithm. 

— Spectral Clustering (SC): Cluster the data into two clusters using the 
spectral clustering algorithm [10]. The afiinity matrix was constructed with 
7 nearest neighbors. 

Squared-loss Mutual Information based Clustering (SMIC) : Clus- 
ter the data according to the SMIC method [11]. SMIC was chosen since it 
provides model selection, avoiding the need for subjective parameter tuning. 

We compare the performance of the methods by varying the class balance. 
Two class balances were selected: one with a large difference between the classes 
{p{y = 1) = 0.2 and p'{y = 1) = 0.8) and one with a small difference between 
the two priors {p{y = 1) = 0.35 and p'{y = 1) = 0.65). The average labeling 
error rate and standard deviation of the two experiments, with | A'pl = \Xpi \ = 40 
is given in Tables 1 and 2. 

From the results we see that methods which follow the approach proposed in 
Section 2 of estimating the sign of the density difference (i.e., DSDD, LSDD, and 
KDE) generally work better than methods using the cluster structure of the data 
(i.e., KM, SC and SMIC). The thyroid dataset lends itself to interpretation of 
why these methods work better. The labels in the thyroid dataset correspond to 
healthy and diseased. The diseased label is caused by either a hyper-functioning 
or hypo- functioning thyroid. These two underlying causes cause within-class mul- 
timodality which may cause clustering-based methods to fail. 

Among the methods which estimate the sign of the density difference, we see 
that DSDD generally performs better than LSDD and LSDD in turn performs 
better than KDE. This is as expected since KDE solves a more general problem 
than LSDD, and LSDD solves a more general problem than DSDD. This pattern 
is even more pronounced on the more difficult case where the class balances are 
close to each other (Table 2). 




Table 1. Labeling error rate for experiments with a class prior of p{y = 1) = 0.2 and 
p (y = 1) = 0.8. The size of each dataset was \Xp\ = 40 and \Xpi\ = 40. The best 
method in terms of the mean error and comparable methods according to the two- 
sided paired t-test at the significance level 5% are specified by bold face. The standard 
deviation of the labeling error rate is given in brackets. 



Dataset 


DSDD 


LSDD 


KDE 


KM 


SC 


SMIC 


australian 


.142 (.045) 


.174 (.110) 


.211 (.126) 


.266 (.147) 


.381 (.033) 


.303 


(.103) 


banana 


.179 (.097) 


.170 (.070) 


.237 (.147) 


.431 (.068) 


.427(.141) 


.424 


(.141) 


diabetes 


.246 (.122) 


.223 (.079) 


.226(.051) 


.372 (.080) 


.380 (.094) 


.370 


(.131) 


german 


.268 (.059) 


.281 (.127) 


.211(.051) 


.437 (.114) 


.448 (.128) 


.439 


(.052) 


heart 


.176 (.051) 


.174 (.047) 


.211 (.074) 


.261 (.131) 


.310 (.032) 


.327 


(.107) 


image 


.198 (.078) 


.206 (.047) 


.201 (.049) 


.385 (.093) 


.351 (.119) 


.384 


(.135) 


ionosphere 


.157 (.059) 


.184 (.106) 


.194 (.123) 


.329 (.145) 


.319(.113) 


.311 


(.174) 


saheart 


.310 (.093) 


.205 (.048) 


.238 (.113) 


.422 (.121) 


.395 (.113) 


.384 


(.072) 


thyroid 


.102 (.052) 


.121 (.116) 


.207 (.074) 


.328 (.113) 


.326 (.109) 


.305 


(.074) 


twonorm 


.044 (.085) 


.051 (.072) 


.200 (.028) 


.036 (.054) 


.043 (.069) 


.048 


(.071) 


Table 2. Labeling error 


rate for experiments with a class prior of p{y — 1) 


= 0.35 and 


p'{y = 1) = 


0.65. The size of each dataset was 


Xp\ = 40 and \Xp,\ = 40 


The best 


method in terms of the 


mean error 


and comparable methods according to the two- 


sided paired t-test at the 


significance level 5% are 


specified by bold face. The standard 


deviation of the labeling 


error rate is 


given in brackets. 








Dataset 


DSDD 


LSDD 


KDE 


KM 


SC 


SMIC 


australian 


.244(.116) 


.259 (.088) 


.355 (.104) 


.265 (.080) 


.376 (.065) 


.308 


(.107) 


banana 


.338 (.094) 


.339 (.100) 


.365 (.067) 


.433 (.049) 


.427 (.069) 


.424 


(.070) 


diabetes 


.340 (.075) 


.361 (.124) 


.345 (.034) 


.373 (.063) 


.380 (.048) 


.371 


(.114) 


german 


.375 (.042) 


.380 (.093) 


.354 (.057) 


.437 (.024) 


.445 (.057) 


.438 


(.041) 


heart 


.270 (.133) 


.247 (.084) 


.354 (.052) 


.264 (.059) 


.315 (.081) 


.327 


(.089) 


image 


.331 (.078) 


.350 (.067) 


.350 (.039) 


.384 (.031) 


.354 (.049) 


.382 


(.050) 


ionosphere 


.291 (.099) 


.356 (.066) 


.345 (.048) 


.330 (.070) 


.322 (.058) 


.314 


(.107) 


saheart 


.378 (.093) 


.353 (.057) 


.363 (.066) 


.419 (.082) 


.395 (.022) 


.385 


(.040) 


thyroid 


.227 (.098) 


.251 (.087) 


.302 (.022) 


.326 (.061) 


.329 (.047) 


.307 


(.076) 


twonorm 


.164 (.188) 


.153 (.121) 


.352 (.096) 


.036 (.053) 


.042 (.122) 


.049 


(.120) 



5 Conclusion 

The problem of unsupervised labeling of two unbalanced datasets was consid- 
ered. We first showed that this problem can be solved if two unlabeled datasets 
having different class balances are available. The solution can be obtained by 
estimating of the sign of the difference between probability densities. We intro- 
duced a method to directly estimate the sign of the density difference and avoid 
density estimation. The method was shown on various datasets to outperform 



competing methods that either estimate the density difference or use the cluster 
structure of the data. 

Because the sign of density difference corresponds to the Bayes optimal classi- 
fier under equal class balance, it may be estimated by any classifier that separates 
Xp and Xpi . Following this idea, we tested the support vector machine (SVM) for 
estimating the sign of density difference. However, this did not work well due to 
the high overlap of Xp and Xp' — both the datasets are mixtures of two classes, 
only with different mixing ratios. 

From this classification point of view, we can actually see that our objective 
function (7) corresponds to the robust SVM [12] that minimizes the ramp loss 
(a clipped hinge loss). Thanks to the robustness brought by the ramp loss, the 
overlapped datasets Xp and Xpi can be separated more reliably, and thus we 
obtained good estimation of the sign of density difference. 

Furthermore, this view conversely shows that the robust SVM is actually a 
suitable classification method because it directly estimates the Bayes optimal 
classifier, the sign of density difference. Labeling and classification are different 
problems, but one can actually give insight into the other. In the future work, 
we will further investigate the relation between labeling and classification. 
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A Optimization 

This section outlines the optimization of Eq. (7) using the convex concave pro- 
cedure[7]. The non-convex function R{z) can be re-written as 

R{z) = C-i{z) — Ci{z) — I, where Cc(z) = max(0, z — e). 

The convex part of the objective function can then be expressed as 

i"' \i" 

Jvox(a) = ^Xl^-i 51 "^"^^(^j) +9 I]"?' 

1=1 \l=l ) j = l \£=1 / £=1 

and the concave part as 

Jcavc(a) = XI ( XI ) " ^ X '^-1 ( X "^'^^(^j) ) • 

1=1 \i=\ ) 3=1 \e=i ) 

The following self-evident relation can be used to bound the concave part 

tz - Lp(t) < supyz - ip{y) 
^^{t)>tz^^*{z), 

where 

Lp*{z) = supyz - Lp{y) 

is known as the convex conjugate. The convex conjugate of the function Ce(z) is 

{oo z < 
ez < z < 1 
oo z > 0. 

This gives an upper bound on the concave function as 

Jcave(a, b,c) = —J2[ CUbi) - X O^'iMXi) ) + - X ( ^-l(^j) - X 
i=l \ e=l / 3 = 1 \ i=l 

where b = [6i 62 ■ • ■ and c = [ci C2 ... c„] specify the bound. 



A.l Tightening the bound 

The bound can be tightened around a by minimizing Jcave(ci, t>, c) w.r.t. b and c. 
To ensure that we have a non-trivial bound, we can exphcitly write the conjugate 
as constraints, 



i=i \ i=i ) j=i \ e=i 



aeipe(Xj) 



s.t. < 6j < 1,0 < < 1. 



The above optimization problem is separable in all unknowns, and the optimal 
value can be obtained by Eq. (8). 

A. 2 Minimizing the upper bound 

The upper bound of the objective function with b and c is 

Jvcxia) + Jcavo(a, b, c). 
By replacing each function Ce{z) with a slack variable ^i, and the constraint 

6 >0,i,>z- e, 
we obtain the objective function in Eq. (9) 

B Least-squares estimation of the density difference 

In [9] it was proposed to directly estimate the density difference by fitting a 
model g{x) to the true density difference f{x) under a square loss: 



argmm 
9 2 



^ J [aix) -[p{x) -p{x)]^ dx. 

The density difference was modeled by a lincar-in-parametcr model g{x): 

b 

g(a;) = ^0,V£(a;) = 0^V(a=), (10) 

e=i 

where b denotes the number of basis functions, iIj{x) = {ipi{x), . . . ,ipi,{x))^ 
is a 6-dimensional basis function vector, 6 = {9i, . . . ,di,)^ is a 6-dimensional 
parameter vector, and ^ denotes the transpose. A Gaussian kernel model is used 
to model the density difference: 



e—1 V / 



where (ci, . . . , c„, c„+i, . . . , c„+„/) := [xi, . . . ,Xn,x'^, ■ ■ ■ ,x'^,) are Gaussian 
kernel centers. For the model in Eq. (10), the optimal parameter 6* is given 

by 

lin \ I [fjix) - \p{x) - p' [x]]^ dx 



argmm 




argmm 



argmm 
e 



g{x)^dx — / g{x) [p{x) — p'{x)] dx 



2 



where H is the b x b matrix and h is the 5-dimensional vector defined as 



H := j 'ip{x)xlj{xy dx, 
h:^j^ix)pi^x)dx-j^ix)p'ix)dx. 



For the Gaussian kernel model, the integral in H can be computed analytically 
as 



exp 



\\x - Ci\ 
2cr2 



exp 



\X - Q> 

2ct2 



da; 



= {1,0 ) I exp y 

where d is the dimensionality of x. 

Replacing the expectations in h by empirical estimators and adding an ii- 
regularizer to the objective function, we arrive at the following optimization 
problem: 



Q := argmin 
e 



-e^He-he + -\e^e 
2 2 



(11) 



where A (> 0) is the rcgularization parameter and h is the 6-dimcnsional vector 
defined as 



-.71 1 ^ 



i=i j=i 

Taking the derivative of the objective function in Eq.(ll) and equating it to 
zero, we can obtain the solution 6 analytically as 

e = {H + xiby^h, 

where denotes the 6-dimensional identity matrix. Finally, the density differ- 
ence estimator is 



f{x)^e Mx) 



